In this notebook you fit a Gradient Boosting Machine (GBM) model using R, and then publish the model as a web service on the Azure Machine Learning platform.
You should have some experience with R and know a little about Azure ML web services
GBM is well-known among data scientists and as a Kaggle Profile explains, it has several major advantages compared with traditional statistical models like linear regression:
For users who are used to fitting GBM models in Azure ML Experiments, a major advantage of using Azure ML notebooks is that there are more modeling options. For example, when the response variable is continunous you can use the "Boosted Decision Tree Regression" module for Experiments to fit a GBM model. This module, however, does not allow users to specify the types of loss functions (for statisticians, this means that you can't specify the distribution for the response variable). On the other hand, with the gbm
package in R, you can choose from a wide variety of loss functions.
In this example, you use the housing data from the R package MASS
. There are 506 rows and 14 columns in the dataset. Available information includes median home price, average number of rooms per dwelling, crime rate by town, etc. Find more information about this dataset typing help(Boston)
or ?Boston
in an R terminal, or at this UCI page.
In [1]:
library(MASS) # to use the Boston dataset
?Boston
Out[1]:
In a GBM model, there are several hyperparameters and we need to estimate them first. One way to estimate these parameters is to use cross validation on a parameter-grid. In our example, we'll optimize the following parameters over a grid: number of estimators, maximum tree depth, minimum number of samples on a split, and learning rate. To do this we start by providing several values for each of them and create a set of combinations, each combination consisting of one value for each parameter. Then for each combination we use cross validation to estimate the performance, using root mean squared error as performance metric. The "caret" package will be used in this process.
In [2]:
# load the libraries
if(!require("gbm")) install.packages("gbm")
library(gbm)
In [ ]:
In [3]:
model1 <- gbm(medv ~ ., data = Boston,
distribution = "gaussian",
n.trees = 5000,
interaction.depth = 2,
n.minobsinnode = 1,
shrinkage = 0.001)
In [4]:
# summarize the model
options(repr.plot.width = 4, repr.plot.height = 4)
summary(model1)
Out[4]:
In [5]:
# plot cv results
plot(model1)
In [6]:
# fit the model
model2 <- gbm(medv ~ ., data = Boston,
distribution = "gaussian",
n.trees = 10000,
interaction.depth = 4,
n.minobsinnode = 1,
shrinkage = 0.01)
summary(model2)
Out[6]:
For the fitted model,we can look closely at how the number of trees affect loss function on training and validation data to select the best value.
In [7]:
# load the library
library(AzureML)
# workspace information
ws <- workspace()
# define predict function
predict_gbm <- function(newdata){
require(gbm)
predict(model2, newdata, n.trees = 1000)
}
# test the prediction function
newdata <- Boston[1:10, ]
pred <- predict_gbm(newdata)
data.frame(actual = newdata$medv, prediction = pred)
Out[7]:
In [8]:
# Publish the service
ep <- publishWebService(ws = ws, fun = predict_gbm,
name = "HousePricePredictionGBM",
inputSchema = newdata)
str(ep)
In [9]:
pred <- consume(ep, newdata)$ans
data.frame(actual = newdata$medv, prediction = pred)
Out[9]:
Using the Boston
housing dataset, we started the analysis by estimating the parameters in the GBM model. Then we fitted the model and examined variable importance. A web service was also deployed based on the selected model.
In addition to the Gaussian distribution which uses squared error loss function, the gbm
package allows for several other distributions: laplace which uses absolute loss, t-distribution which uses t-distribution loss, etc.
The caret
package makes it possible to easily tune the hyperparameters on a grid.
Created by a Microsoft Employee.
Copyright (C) Microsoft. All Rights Reserved.